As a reminder, I’m investigating Congressional bills from the 112th to 115th Congresses; within that group, I’m looking specifically at bills that passed the House in each of those Congresses. The data were sourced from three places: ProPublica’s Congress API, govtrack.us, and voteview.com (for DW_NOMINATE score). With the data collected and (for the most part) pre-processed in Python (see the script here if you haven’t already), I next turned to R to understand and analyze the data.
To begin, I’ve included the pre-processed data in tabular form below. Recoding in R included making dates be recognized as dates, converting between nominal and numeric attributes, and fixing some string cases.
require(tidyverse)
data <- read.csv("C:/Users/johnr/OneDrive/Spring 2019/INLS 625/Project/Processed with No Text.csv")
data <- subset(data, select = -c(1,7,9,11,17))
data$cospons_r[is.na(data$cospons_r)] <- 0
data$cospons_d[is.na(data$cospons_d)] <- 0
data$cospons_i[is.na(data$cospons_i)] <- 0
data$introduced_date <- as.Date(data$introduced_date)
data$primary_subject <- tolower(data$primary_subject)
data$primary_subject <- str_to_title(data$primary_subject, locale = "en")
names(data) <- c("bill_id","bill_slug","bill_type","committees",
"cosponsors","introduced_date","primary_subject",
"sponsor_id","sponsor_name","sponsor_party",
"sponsor_state","sponsor_title",
"congress","dw_nom_1",
"dw_nom_2","sponsor_gender","sponsor_twitter",
"sponsor_leadership_role","sponsor_seniority",
"sponsor_party_loyalty","sponsor_district",
"sponsor_age","cosponsors_r","cosponsors_d",
"cosponsors_i","bill_len","bill_avg_word_len",
"bill_num_stopwords","bill_num_numerics",
"bill_num_usc_refs","result")
data$sponsor_party <- as.character(data$sponsor_party)
data$sponsor_party_n[data$sponsor_party == "R"] <- -1
data$sponsor_party_n[data$sponsor_party == "I"] <- 0
data$sponsor_party_n[data$sponsor_party == "D"] <- 1
data$sponsor_gender_n <- as.numeric(data$sponsor_gender)
data$sponsor_leadership <- !(data$sponsor_leadership_role == "")
data$result_simplified <- NA
data$result_simplified[as.character(data$result) %in% c("Became law","Passed; not law (e.g. CR)","Vetoed")] <- "Made it through"
data$result_simplified[as.character(data$result) %in% c("Went to senate","Other","Didn't leave Congress")] <- "Languished in Congress"
data
Next, I’ve provided a few visualizations so we can get to know the data. First: what do the bills deal with? Each bill is assigned a primary subject, using categories established by the Congressional Research Service. Below, you can see that certain categories of bills are far more common than others. Bills in just two categories - those dealing with Congress and with “”Government Operations and Politics" - make up over 30% of Bills that passed the House in the 112th - 115th Congresses. Those categories deal with general government oversight, operations, administration, elections, ethics.
require(questionr)
subject_data_ordered <- subject_data <- freq(subset(data, select = c(7))$primary_subject)
subject_data_ordered$subject <- subject_data$subject <- rownames(subject_data)
subject_data_ordered$subject <- factor(subject_data$subject, levels = subject_data[order(subject_data$n), "subject"])
ggplot(subject_data_ordered, aes(x = subject, y = n)) +
geom_bar(stat = "identity", fill = "black") +
coord_flip() +
labs(title = " Figure 1: Frequencies of Bill Primary Subjects",
caption = "Categorized into 32 bins used by the Congressional Research Service,\nmore information at https://www.congress.gov/help/field-values/policy-area\n",
y = "", x = "")
Next: who is putting these bills forward? I have collected a variety of characteristics about bill sponsors, which I review below. First, in the figure below, the sponsor ages, seniority ranks, and ideologies are plotted. Seniority measures the number of years a member has served. dw_nom_1 and dw_nom_2 measure member ideology and are calculated from roll call vote records; the first dimension measures the member’s position re: government intervention in the economy and the second dimension measures the member’s positions with respect to salient social issues of the day, e.g. slavery in the early-mid 19th Century and LGBTQ rights today.
As the figure shows, sponsors of bills in the relevant timeframe are polarized on the first (economic) dimension but generally share similar positions on the second (social) dimension. They also tend to be older; while the overall US population with bimodal (~30, ~60), the younger population is underrepresented in this sample. That is to be expected though, as Congress as a whole is older than the US population.
subset(data, select = c(14,15,19,22)) %>%
gather() %>% # Convert to key-value pairs
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_density() +
theme_bw() +
labs(title = "Figure 2: Densities of Selected Continuous Attributes")
Next, I plotted a few discrete characteristics of the data; as the figure below shows, the number of bills that passed the House steadily increased from the 112th through the 115th Congress. These bills were overwhelmingly sponsored by men, and overwhelmingly sponsored by Republicans. The gender disparity is to be expected due to the overall gender disparity in the House, though it might be enhanced in this case by the party disparity, as the Republican House caucus is more overwhelmingly-male than its Democratic counterpart.
The disparity in sponsorships by party is further explored in the second figure below, which shows that the share of Democratic- to Republican-sponsored bills was roughly equal across the four Congresses at hand. This stark imbalance in sponsorships by party is to be expected; the Republican Party held the majority in the House for each of these four Congresses. This fact makes comparison within/among the four Congresses more sound; if the majority and leadership swapped part-way through, many facets might change as a result. However, this also should be noted as a limit on the generalizability of any findings from this project; they only apply to Republican-held Houses, and really only these specific Congresses, as so much in politics depends on temporal context.
subset(data, select = c(13,16,10)) %>%
gather() %>% # Convert to key-value pairs
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_bar(fill = "black") +
theme_bw() +
labs(title = "Figure 3: Distributions of Selected Categorical Attributes")
summarise(group_by(data, congress, sponsor_party), bill_count = n()) %>%
ggplot(aes(fill = sponsor_party, x = congress, y = bill_count)) +
geom_bar(stat = "identity", position = "fill") +
scale_fill_manual(name = "Sponsor Party",
values = c("#3487BD","black","#D63E50")) +
labs(title = "Figure 4: Bills Sponsored by Congress and Sponsor Party",
y = "Proportion of Bills Sponsored\n",
x = "Congress")
Finally, I examined the characteristics of the bill texts themselves, which I calculated through text processing in Python. The distributions of those characteristics are plotted below. The distributions are plotted with logarithmic scales on the x axes, due to the extreme right skew of the distributions. This makes the bill_avg_word_len plot slightly unorthodox, but for speed of coding, I used a simple facet_wrap() call that uses the same scale type for all plots - a decent compromise, since you can still understand what the bill_avg_word_len plot is getting across.
In order, the characteristics plotted below are:
subset(data, select = c(26:30)) %>%
gather() %>% # Convert to key-value pairs
ggplot(aes(value)) + # Plot the values
facet_wrap(~ key, scales = "free") + # In separate panels
geom_density() + # as density
scale_x_log10() +
theme_bw() +
labs(title = "Figure 5: Densities of Selected Characteristics: Bill Text")
With an understanding of each variable itself, I next turned to gaining an understanding of the relationships between variables. In the figure below, I have plotted the correlations between each variable-pair in the dataset; the size and color of each circle represents the magnitude and direction of any correlation between the relevant two variables. The deeper the color and the larger the circle, the stronger the relationship; if the circle is blue, the correlation is positive/direct, and if the circle is red, the correlation is negative/inverse.
Several top-line conclusions can be drawn from this figure:
cosponsors* variables, except cosponsors_i, are strongly related to each other - even cosponsors_d and cosponsors_r are strongly, positively correlated. As the number of cosponsors increases, the number of cosponsors in each party also generally increases. The nonexistent correlation of cosponsors_i with any of the other cosponsors* variables is likely due to the fact that there are simply barely any independent cosponsors in the dataset at all (total across all bills = 54, out of over 75,000 cosponsorships across all bills).bill_num_stopwords, bill_num_usc_refs, and bill_num_numerics) are raw counts, which should increase as the overall bill text increases in length (bill_len). Average word length, however, should have little relationship to how long the document is.dw_nom_1). A decrease in dw_nom_1 is associated with supporting greater government intervention in the economy, i.e. supporting traditionally-Democratic proposals. An increase in sponsor_party_n represents moving towards Democratic (Republican = -1, Independent = 0, Democratic = 1). Therefore, this inverse relationship makes sense.dw_nom_1 scores, i.e. prefer more government intervention in the economy. This is in contrast to the pattern in the wider US population, where getting older usually predicts having more conservative/libertarian economic positions.cor.mat <- cor(subset(data, select = c(5,13:15,19,26:30,32:33)))
cor.mat.rounded <- round(cor.mat, 2)
require(corrplot)
corrplot(cor.mat.rounded, type = "lower", number.cex = .7, order = "AOE", tl.cex = 0.8, tl.srt = .01, tl.col = "black", title = "Figure 6: Matrix of Correlations Between Numeric Attributes", col = colorRampPalette(c("#3487BD","white","#D63E50"))(200))
The final attribute in the data that will prove crucial in later analysis is the ultimate fate of each bill - did it make it through Congress, get signed by the President, and become law? Of course, reality is a bit more complex than that, and there are more possible outcomes than either dying in the House or fully becoming law; these are the seven possible outcomes into which I recoded the bills, listed in descending order of frequency:
In addition to this six-level categorical variable, I coded a dichotmous version result_simplified, which combined the six categories listed above into the following two:
Clearly, there are distinct differences between bills that became law and those that were vetoed, but in this simple dichotmous split, the four bills that were vetoed are more like the other bills that also made it all the way through Congress than those that didn’t make it out at all.
In the figure below, I show the breakdown of bill fates within each of the 32 primary subject categories I reviewed above.
summarise(group_by(data, result, primary_subject), bill_count = n()) %>%
ggplot(aes(fill = result, x = primary_subject, y = bill_count)) +
geom_bar(stat = "identity", position = "fill") +
coord_flip() +
scale_fill_brewer(palette = "Spectral", guide = guide_legend(reverse = TRUE), name = "Result") +
labs(y = "Percent of Bills within Subject Area",
title = "Figure 7: Fates of Bills that Passed the House\nin the 112th - 115th Congresses",
x = "") +
theme(plot.title = element_text(hjust = 0.5))
#https://machinelearningmastery.com/machine-learning-in-r-step-by-step/
write.csv(data, "C:/Users/johnr/OneDrive/Spring 2019/INLS 625/Project/R_Processed_Data.csv")
#require(caret)
#complete_data <- data[complete.cases(data),]
#validation_index <- createDataPartition(complete_data$bill_id, p = 0.85, list = F)
#validation_data <- complete_data[-validation_index,]
#training_data <- complete_data[validation_index,]
#control <- trainControl(method = "cv", number = 10)
#metric <- "Accuracy"
#set.seed(7)
#model_lda <- train(result_simplified~., data = training_data, method = "lda", metric = metric, trControl = control)
#model_knn <- train(result_simplified~., data = training_data, method = "knn", metric = metric, trControl = control)
#model_rf <- train(result_simplified~., data = training_data, method = "rf", metric = metric, trControl = control)
With a firm understanding of the data, I could now turn to attempting predictions. For that, I turned to KNIME; I attempted to use both R and Weka at different points, but encountered more obstacles with both of those platforms’ machine learning tools than with KNIME’s.
With the data I have, both supervised and unsupervised learning methods can yield interesting results. First, I undertook unsupervised learning, specifically clustering, as I had seen (as reviewed above) that there were certain groupings in the data that might form nice clusters. For that cluster analysis, I used k-Means clustering in KNIME. In order to prep the data for that algorithm, several steps had to be taken:
sponsor_party => sponsor_democratsponsor_gender => sponsor_femalesponsor_leadership_role => sponsor_leadershipcosponsors_d, cosponsors_r, and cosponsors_i were dropped, leaving cosponsorsbill_num_stopwords and bill_num_numerics were dropped, leaving bill_len and bill_num_usc_refsresult variable, and I thought it might be interesting to see if clusters appear which are similar to/predict the ultimate fate of the bills (I undertake that task more directly with the supervised learning models below.) After performing the cluster analysis in KNIME, I ported the data back over to R to make tables and plots. The following table contains the mean value of each cluster on each of the attributes used in the analysis; put together, the table represents the seven centroids of the clusters, in 12-dimensional space. From this table, we can see that there is more difference between the clusters on some attributes than other others. For example, there is a lot of variation between the clusters on bill_len, but not much variation between clusters on congress.km_1_clusters <- read.csv("C:/Users/johnr/OneDrive/Spring 2019/INLS 625/Project/K-Means-Clusters1.csv")
km_1_clusters
In fact, when some of the most informative attributes from that cluster analysis are plotted below, it becomes clear that bill_len is an incredibly strong driver of the clustering.
require(plotly)
km_1_data <- read.csv("C:/Users/johnr/OneDrive/Spring 2019/INLS 625/Project/K-Means-Output1.csv")
plot_ly(km_1_data, type = "scatter3d", mode = "markers", x = ~bill_len, y = ~dw_nom_2,
z = ~sponsor_seniority, color = ~Cluster, hoverinfo = 'text', text = ~row.ID, colors = "Spectral") %>%
layout(title = "Figure 8: k-Means Clustered Bills Passed in the House,\n112th - 115th Congresses")
The next models I developed for this data were supervised; specifically, random forest decision trees, X, and X.
I again built these models in KNIME. In contrast to the additional data processing that was necessary before the cluster analysis above, no changes were made to the data between export from R, import into KNIME, and running of the models. For this model set, I asked KNIME to predict the six-level result column - the more detailed fates of bills in the data set.
I built two RandomForest node chains in KNIME; one primary based on a simple 85/15 partition and one secondary with an X-Partitioner/X-Aggregator loop for 10-fold CV. The model I discuss below is that developed by the simple 85/15 partition, as that chain produced more detailed output, with not only the consensus prediction, but the number of models which agreed on that prediction and the proportion of models that predicted each possible outcome for each instance (think probabilistic modeling). I’m including that data with those columns below. I’ve also coded a match column, simplifying the six-level prediction column into a two-level boolean column.
rf_1_data <- read.csv("C:/Users/johnr/OneDrive/Spring 2019/INLS 625/Project/RF_Data1.csv")
rf_1_data['match'] <- (as.character(rf_1_data$result) == as.character(rf_1_data$result..Out.of.bag.))
rf_1_data
Overall, the models developed by the partitioned-RF model set had an accuracy of 76.8%. That performance differed by result category; the model was better at predicting the Passed, but not law and Went to Senate and not as good at predicting the Became law and Didn’t leave Congress results. I base that conclusion off of the F-scores for each result category as reported by KNIME and included below. Stats for the Vetoed and Other categories were not reported.
rf_1_acc_stats <- read.csv("C:/Users/johnr/OneDrive/Spring 2019/INLS 625/Project/RF_AccStats1.csv")
rf_1_acc_stats
The RandomForest model’s performance can also be visualized graphically, in addition to through tables. Below, I’ve graphed a simpler version of the stats from the table above; for each category, I calculated the percent of bills in which the RF model’s predicted category (the most common prediction from the 100 individual models) matched the true value. This shows the same conclusions drawn above; Went to Senate and Didn’t leave Congress had the greatest within-group accuracy, Other and Became law were predicted correctly less often - though still over 50% of the time - and the remaining two categories were predicted incorrectly 100% of the time, almost certainly due to their relatively miniscule sample size.
summarise(group_by(rf_1_data, result, match), bill_count = n()) %>%
ggplot(aes(fill = match, x = result, y = bill_count)) +
geom_bar(stat = "identity", position = "fill") +
coord_flip() +
scale_fill_manual(guide = guide_legend(reverse = TRUE),
name = "Prediction",
labels = c("Incorrect","Correct"),
values = c("#D63E50","#3487BD")) +
labs(y = "Proportion of Bills within Result Category",
title = "Figure 9: RandomForest Performance By Bill Fate,\nBills that Passed the House in the 112th - 115th Congresses",
x = "") +
theme(plot.title = element_text(hjust = 0.5))
Another way to visualize the model’s performance is to examine its accuracy by primary subject area; was the model particularly better or worse at predicting the result of any bills in any certain subjects? The answer to that question is plotted below. Overall, the model performed realtively-similarly across almost all subject areas. There are three outlier categories in which the model was either 100% correct or 100% incorrect in its predictions:
An additional outlier, Arts, Culture, and Religion, was not 100% one way or the other, but it had markedly lower accuracy than the remaining subject areas.
As we saw in Figure 1, each of these four bills is among the least-frequent (together, they make up 4 of the bottom 6 categories by frequency). As such, it again follows that the model would perform worse in these categories, simply due to the small sample size to work with.
summarise(group_by(rf_1_data, primary_subject, match), bill_count = n()) %>%
ggplot(aes(fill = match, x = primary_subject, y = bill_count)) +
geom_bar(stat = "identity", position = "fill") +
coord_flip() +
scale_fill_manual(guide = guide_legend(reverse = TRUE),
name = "Prediction",
labels = c("Incorrect","Correct"),
values = c("#D63E50","#3487BD")) +
labs(y = "Proportion of Bills within Subject Area",
title = "Figure 10: RandomForest Performance By Subject Area,\nBills that Passed the House in the 112th - 115th Congresses",
x = "") +
theme(plot.title = element_text(hjust = 0.5))
…